Adaptive Feature Abstraction for Translating Video to Text
نویسندگان
چکیده
Previous models for video captioning often use the output from a specific layer of a Convolutional Neural Network (CNN) as video features. However, the variable contextdependent semantics in the video may make it more appropriate to adaptively select features from the multiple CNN layers. We propose a new approach for generating adaptive spatiotemporal representations of videos for the captioning task. A novel attention mechanism is developed, that adaptively and sequentially focuses on different layers of CNN features (levels of feature “abstraction”), as well as local spatiotemporal regions of the feature maps at each layer. The proposed approach is evaluated on three benchmark datasets: YouTube2Text, M-VAD and MSR-VTT. Along with visualizing the results and how the model works, these experiments quantitatively demonstrate the effectiveness of the proposed adaptive spatiotemporal feature abstraction for translating videos to sentences with rich semantics.
منابع مشابه
Adaptive Feature Abstraction for Translating Video to Language
Previous models for video captioning often use the output from a specific layer of a Convolutional Neural Network (CNN) as video representations, preventing them from modeling rich, varying context-dependent semantics in video descriptions. In this paper, we propose a new approach to generating adaptive spatiotemporal representations of videos for a captioning task. For this purpose, novel atte...
متن کاملVideo Abstraction in H.264/AVC Compressed Domain
Video abstraction allows searching, browsing and evaluating videos only by accessing the useful contents. Most of the studies are using pixel domain, which requires the decoding process and needs more time and process consuming than compressed domain video abstraction. In this paper, we present a new video abstraction method in H.264/AVC compressed domain, AVAIF. The method is based on the norm...
متن کاملAdaptive Spectral Separation Two Layer Coding with Error Concealment for Cell Loss Resilience
This paper addresses the issue of cell loss and its consequent effect on video quality in a packet video system, and examines possible compensative measures. In the system's enconder, adaptive spectral separation is used to develop a two-layer coding scheme comprising a high priority layer to carry essential video data and a low priority layer with data to enhance the video image. A two-step er...
متن کاملAn Efficient Adaptive Boundary Matching Algorithm for Video Error Concealment
Sending compressed video data in error-prone environments (like the Internet and wireless networks) might cause data degradation. Error concealment techniques try to conceal the received data in the decoder side. In this paper, an adaptive boundary matching algorithm is presented for recovering the damaged motion vectors (MVs). This algorithm uses an outer boundary matching or directional tempo...
متن کاملTranslating Images to Words for Recognizing Objects in Large Image and Video Collections
We present a new approach to the object recognition problem, motivated by the recent availability of large annotated image and video collections. This approach considers object recognition as the translation of visual elements to words, similar to the translation of text from one language to another. The visual elements represented in feature space are categorized into a finite set of blobs. Th...
متن کامل